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Abstract 

Background: Progressive metliods offer efficient and reasonably good solutions to the nnultiple sequence alignment 
problenn. However, resulting alignnnents are biased by guide-trees, especially for relatively distant sequences. 

Results: We propose MSARC, a new graph-clustering based algorithm that aligns sequence sets without guide-trees. 
Experiments on the BAIiBASE dataset show that MSARC achieves alignment quality similar to the best progressive 
methods. 

Furthermore, MSARC outperforms them on sequence sets whose evolutionary distances are difficult to represent by a 
phylogenetic tree. These datasets are most exposed to the guide-tree bias of alignments. 

Availability: MSARC is available at http://bioputer.mimuw.edu.pl/msarc 
Keywords: Multiple sequence alignment, Stochastic alignment, Graph partitioning 



Background 

Determining the alignment of a group of biological 
sequences is among the most common problems in com- 
putational biology. The dynamic programming method of 
pairwise sequence alignment can be readily extended to 
multiple sequences but requires the computation of an n- 
dimensional matrix to align n sequences. Since the size of 
such a matrix is exponential with respect to n, the time 
and space complexity of this method is exponential too. 

Progressive alignment [1] offers a substantial complexity 
reduction at the cost of possible loss of the optimal solu- 
tion. Within this approach, subset alignments are sequen- 
tially pairwise aligned to build the final multiple align- 
ment. The order of pairwise alignments is determined by 
a guide-tree representing the phylogenetic relationships 
between sequences. 

There are two drawbacks of the progressive alignment 
approach. First, the accuracy of the guide-tree affects the 
quality of the final alignment. This problem is particu- 
larly important in the field of phylogeny reconstruction, 
because multiple alignment acts as a preprocessing step 
in most prominent methods of inferring a phylogenetic 
tree of sequences. It has been shown that, within this 
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approach, the inferred phylogeny is biased towards the 
initial guide-tree [2,3]. 

Second, only sequences belonging to currently aligned 
subsets contribute to their pairwise alignment. Even if 
a guide-tree reflects correct phylogenetic relationships, 
these alignments may be inconsistent with remaining 
sequences and the inconsistencies are propagated to fur- 
ther steps. To address this problem, in recent programs 
[4-8] progressive alignment is usually preceded by consis- 
tency transformation (incorporating information from all 
pairwise alignments into the objective function) and/or 
followed by iterative refinement of the multiple alignment 
of all sequences. Moreover, recently several strategies 
avoiding guide trees altogether were also proposed [9-11]. 

In the present paper we propose MSARC, a new 
non-progressive multiple sequence alignment algorithm. 
MSARC constructs a graph with all residues from all 
sequences as nodes and edges weighted with alignment 
affinities of its adjacent nodes. Columns of best multi- 
ple alignments tend to form clusters in this graph, so in 
the next step residues are clustered (see Figure 1). Finally, 
MSARC refines the multiple alignment corresponding to 
the clustering. 

Experiments on the BAIiBASE dataset [12] show that 
our approach is competitive with the best progres- 
sive methods and significantly outperforms most non- 
progressive algorithms. Moreover, MSARC is the best 
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Figure 1 Alignment grapli and its desired clustering. Clusters 
form columns of a corresponding multiple sequence alignment. 



aligner for sequence sets with very low levels of conserva- 
tion. This feature makes MSARC a promising preprocess- 
ing tool for phylogeny reconstruction pipelines. 

Methods 

MSARC aligns sequence sets in several steps. In a prepro- 
cessing step, following Probalign [8], stochastic alignments 
are calculated for all pairs of sequences and consistency 
transformation is applied to resulting posterior probabil- 
ities of residue correspondences. Transformed probabili- 
ties, called residue alignment affinities, represent weights 
of an alignment graph^ , 

MSARC clusters this graph with a top-down hierarchi- 
cal method (Figure 2). Division steps are based on the 
Fiduccia-Mattheyses graph partitioning algorithm [13], 
adapted to satisfy constraints imposed by the sequence 
order of residues. Finally, the multiple alignment corre- 
sponding to the resulting clustering is refined with the 
iterative improvement strategy proposed in Probcons [7], 
adapted to remove clustering artefacts. 

Pairwise stochastic alignment 

The concept of stochastic (or probability) alignment was 
proposed in [14]. Given a pair of sequences, this frame- 
work defines statistical weights of their possible align- 
ments. Based on these weights, for each pair of residues 
from both sequences, the posterior probability of being 
aligned may be computed. 

A consensus of highly weighted suboptimal alignments 
was shown to contain pairs with significant probabilities 
that agree with structural alignments despite the opti- 
mal alignment deviating significantly. Miickstein et al. [15] 
suggest the use of the method as a starting point for 
improved multiple sequence alignment procedures. 

The statistical weight W {A) of an alignment A is the 
product of the individual weights of (mis-)matches and 
gaps [16]. It may be obtained from the standard similarity 
scoring function S{A) with the following formula: 



(1) 



where P corresponds to the inverse of Boltzmanns con- 
stant and should be adjusted to the match/mismatch scor- 
ing function s{x,y) (in fact, simply rescales the scoring 
function). 

The probability distribution over all alignments ^* is 
achieved by normalizing this value. The normalization 
factor Z is called the partition function of the alignment 
problem [14], and is defined as 



Z= Y,W{A)= Y,e^'^^\ 



(2) 



The probability P {A) of an alignment can be calculated 



by 



P{A) = 



W{A) 



(3) 



Z Z 

Let P (ai ~ bj) denote the posterior probability that 
residues at and bj are aligned. 

We can calculate it as the sum of probabilities of all 
alignments with at and bj in a common column (denoted 
by A 



P {ai ~ bj) 



J2 e^^^^^ 



AeA 



AeA* r 



J2 Q^S{Ai-i,j-i) 



6s{ai,bj) J2 e^'^CA+i.y+i) 

\A'+1,7+1 , 



Z/-i,y-ie^^(^^'^^)Z,-+i,y+i 



(4) 



Here we use the notation Ai,j for an alignment of the 
sequence prefixes ai • • - at and b\' - - bj, and Ai^j for an 
alignment of the sequence suffixes ai- - - a^ and bj - - - bn^ 
Analogously, Zq is the partition function over the prefix 
alignments and Zq is the (reverse) partition function over 
the suffix alignments. 

An efficient algorithm for calculating the partition func- 
tion can be derived from the Gotoh maximum score 
algorithm [17] by replacing the maximum operations with 
additions [14-16]: 



Zf 



rM 



Zlj = {Zfj_, + Zlj_,) e^^o + Zf^j_,e^^-^ (6) 
Zlj = (Zf_,^j + Zf_,^j) + Zl,/^-^ (7) 
Zij = Zf^ + Zfj + Zlj, (8) 
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Figure 2 Hierarchical divisive clustering of residues. An alignment graph is recursively partitioned by finding a balanced minimal cut while 
maintaining the ordering of residues until all parts have at most one residue from each sequence. The final alignment is constructed by 
concatenating these parts (alignment columns) from left to right. 



The reverse partition function can be calculated using the 
same recursion in reverse, starting from the ends of the 
aligned sequences. 

We also considered a slight modification of formulas 6 
and 7: 

zf^. = z;;/_i.to+zf^._iefc-, (9) 

Zf. = Zf_^/^^ + Z^i_^/^^^K (10) 

In this case insertions and deletions must be separated by 
at least one match/mismatch position. This variant was 
proposed by Miyazawa [14] and applied in the Probalign 
[8] and MSAProbs [18] aligners. 

Alignment graphs 

Let us regard probabilities P (at ~ bj) as a representation 
of a bipartite graph with weighted edges, i.e. a graph with 
residues from both sequences as nodes and edges joining 
each Ui with each bj. 

Given a set S of k sequences to be aligned, we would like 
to analogously represent their residue alignment affinity 
by a /c-partite weighted graph. It may be obtained by join- 
ing pairwise alignment graphs for all pairs of ^'-sequences. 
However, separate computation of edge weights for each 
pair of sequences does not exploit information included 
in the remaining alignments. Thus we decided to address 
this problem with a so called consistency transformation 
[4,7], successfully used in progressive methods. 



In order to incorporate correspondence with residues 
from other sequences, MSARC re-estimates the residue 
alignment affinity according to the following formula: 

c'eS 

(11) 

where Wxy are weights specifying the relative contribution 
to the transformation of a sequence pair xy. 

If Pab is a matrix of current residue alignment affini- 
ties for sequences a and b, the matrix form equivalent 
transformation is given by 

c'eS 

where • stands for matrix multiplication. 

MSARC allows for two options of weight assignments. 
In the first one all the weights are set to 1 and the above 
formula simplifies to the following: 




It results in the variant of consistency transformation used 
in Probalign [8] and ProbCons [7]. 
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In the second option weights are calculated according to 
the following formula: 



\a\ \b\ 

min(|(2|, 



(14) 



The idea behind the above formula is that the sum of a 
row/column of a matrix P^t yields the probability that 
the corresponding residue is aligned to one in the other 
sequence (not a gap). If sequences a and b are similar, 
alignments with fewer gaps are preferred, so (at least for 
the shorter sequence) most of the sums are close to 1. 
Consequently, the w^b is close to 1 as well On the other 
hand, weights are much closer to 0 for pairs of dissimilar 
sequences. 

Thus Wab measures the similarity of sequences a and 
b. Therefore sequences c that are similar to a and b 
contribute to more significantly than others. 

The consistency transformation may be iterated any 
number of times, but excessive iterations blur the struc- 
ture of residue affinity. Following Probalign [8] and Prob- 
Cons [7], MSARC performs two iterations by default. 

Residue clustering 

Columns of any multiple alignment form a partition of 
the set of sequence residues. The main idea of MSARC is 
to reconstruct the alignment by clustering an alignment 
graph into columns. The clustering method must satisfy 
constraints imposed by alignment structure. First, each 
cluster may contain at most one residue from a single 
sequence. Second, the set of all clusters must be orderable 
consistently with sequence orders of their residues. Viola- 
tion of the first constraint will be called ambiguity, while 
violation of the second one - conflict (see Figure 3). 

Towards this objective, MSARC applies top-down hier- 
archical clustering (see Figure 2). Within this approach, 
the alignment graph is recursively split into two parts until 
no ambiguous cluster is left. Each partition step results 
from a single cut through all sequences, so clusterings 
are conflict-free at each step of the procedure. Conse- 
quently, the final clustering represents a proper multiple 
alignment. 

Optimal clustering is expected to maximize residue 
alignment affinity within clusters and minimize it between 
them. Therefore, the partition selection in recursive steps 
of the clustering procedure should minimize the sum of 
weights of edges cut by the partition. This is in fact the 
objective of the well-known problem of graph partition- 
ing, i.e. dividing graph nodes into roughly equal parts such 
that the sum of weights of edges connecting nodes in 
different parts is minimized. 

The Fiduccia-Mattheyses algorithm [13] is an effi- 
cient heuristic for the graph partitioning problem. After 
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Figure 3 Clusterings inconsistent (left and middle) and 
consistent (right) with the alignment structure. 



selecting an initial, possibly random partition, it calcu- 
lates for each node the change in cost caused by moving 
it between parts, called gain. Subsequently, single nodes 
are greedily moved between partitions based on the maxi- 
mum gain and gains of remaining nodes are updated. The 
process is repeated in passes, where each node can be 
moved only once per pass. The best partition found in a 
pass is chosen as the initial partition for the next pass. The 
algorithm terminates when a pass fails to improve the par- 
tition. Grouping single moves into passes helps the algo- 
rithm to escape local optima, since intermediate partitions 
in a pass may have negative gains. An additional balance 
condition is enforced, disallowing movement from a par- 
tition that contains less than a minimum desired number 
of nodes. 

Fiduccia-Mattheyses algorithm needs to be modified in 
order to deal with alignment graphs. Mainly, residues are 
not moved independently; since the graph topology has to 
be maintained, moving a residue involves moving all the 
residues positioned between it and a current cut point on 
its sequence. This modification implies further changes 
in the design of data structures for gain processing. Next, 
the sizes of parts in considered partitions cannot differ 
by more than the maximum cluster size in a final clus- 
tering, i.e., the number of aligned sequences. This choice 
implies minimal search space containing partitions con- 
sistent with all possible multiple alignments. In the initial 
partition sequences are cut in their midpoints. 

The Fiduccia-Mattheyses heuristic may be optionally 
extended with a multilevel scheme [19]. In this approach 
increasingly coarse approximations of the graph are cre- 
ated by an iterative process called coarsening. At each 
iteration step selected pairs of nodes are merged into sin- 
gle nodes. Adjacent edges are merged accordingly and 
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weighted with sums of original weights. The final coars- 
est graph is partitioned using the Fiduccia-Mattheyses 
algorithm. Then the partition is projected back to the orig- 
inal graph through the series of uncoarsening operations 
(see Figure 4), each of which is followed by a Fiduccia- 
Mattheyses based refinement. Because the last refinement 
is applied to the original graph, the multilevel scheme in 
fact reduces the problem of selecting an initial partition to 
the problem of selecting pairs of nodes to be merged. In 
alignment graphs only neighboring nodes can be merged, 
so MSARC just merges consecutive pairs of neighboring 
nodes (see Figure 5). 

Refinement 

An example of alignment columns produced by residue 
clustering can be seen in Figure 6(ab). Presented align- 
ments contain surprisingly many spaces, especially in 
their right parts. Some of them are clearly superfluous, 
e.g. in both alignments there are 3 consecutive columns 
near the right end, each consisting of 4 spaces and 1 
G-nucleotide occupying a different row. 

Therefore we decided to add a refinement step, follow- 
ing the method used in ProbCons [7]. Sequences are split 
into two groups and the groups are pairwise re-aligned. 
Re-alignment is performed using the Needleman-Wunsch 
algorithm with the score for each pair of positions defined 
as the sum of posterior probabilities for all non-gap pairs 
and zero gap-penalty. First each sequence is re-aligned 
with the remaining sequences, since such division is very 
efficient in removing superfluous spaces. Next, several 
randomly selected sequence subsets are re-aligned against 
the rest. 



Figures 6(cd) show the results of refining the align- 
ments from Figures 6(ab). Refinement removed superflu- 
ous spaces from the clustering process and optimized the 
alignment. Note that the final post-refinement alignments 
turned out to be the same for both Fiduccia-Mattheyses 
and multilevel method of graph partitioning. 

Loytynoja and Goldman argue in [3] that progressive 
methods tend to force alignments of non-homologous 
sequence fragments inserted in corresponding locations 
of aligned sequences. This tendency leads to systematic 
errors of the downstream analyses in phylogenetic pipe- 
lines, including overestimation of substitution and dele- 
tion events. Unfortunately, iterative refinement may be 
one of possible source of such effects. Therefore the num- 
ber of iterations in subset re-alignment step in MSARC is 
adjustable, in particular the whole step may be turned off. 

Computational complexity 

Let n denote a number of sequences to align and let / be 
their maximum length. Both time and space complexities 
of stochastic alignment are 0{n^f), 

Computations in the other steps use data structures for 
sparse matrices, so the complexity depends on the num- 
ber c of non-zero values per row/column. This number 
depends on the cutoff parameter tc (entries < tc are set to 
0), namely c < 1/^^- However, we observe that c tends to 
be much lower than this bound, e.g. c rarely exceeds 5 for 
the default tc = 0.01. 

MSARC implementation of consistency transformation 
requires OirP'c^l) time. Space complexity of this and the 
remaining steps is dominated by sparse matrices and 
equals 0{n^cl). 




Figure 4 An example of the coarsening of a graph, the partitioning of the coarse graph, and the subsequent uncoarsening of the 
partitioned coarse graph (without a refinement step after each iteration of uncoarsening). Pairs of nodes selected for merging in eacli step 
of coarsening are liigliliglited. Initial node and edge weights are all one. Node size and edge width, and the nearby number values indicate the 
weights after merging. 
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The time complexity of one pass of the Fiduccia- 
Mattheyses algorithm on whole sequences is 0(n^cP'). 
We observe that the algorithm converges after very few 
passes, but it is hard to prove a reasonable asymptotic 
bound. The complexity of the whole clustering is asymp- 
toticly equal to the complexity of the main partition 
step. 

The time complexity of iterative refinement belongs to 
the class 0(n^cfi), 

Results 

Benchmark data and methodology 

MSARC was tested against the BAliBASE 3.0 benchmark 
database [1]. It contains manually refined reference pro- 
tein alignments based on 3D structural superpositions. 
Each alignment contains core-regions that correspond 
to the most reliably alignable sections of the alignment. 
Alignments are divided into five sets designed to evaluate 
performance on varying types of problems: 



RVlX Equidistant sequences with two different levels of 
conservation 

RVll very divergent sequences (< 20% identity) 
RVl2 medium to divergent sequences 
(20 - 40% identity) 

RV20 Families aligned with a highly divergent "orphan" 
sequence 

RV30 Subgroups with < 25% residue identity between 
groups 

RV40 Sequences with N/C-terminal extensions 
RV50 Internal insertions 

BAliBASE 3.0 also provides a program comparing given 
alignments with a reference one. Alignments are scored 
according to two metrics. A sum-of-pairs score (SP) show- 
ing the ratio of residue pairs that are correctly aligned, 
and a total column (TC) score showing the ratio of cor- 
rectly aligned columns. Both scores can be applied to full 
sequences or just the core-regions. 




(a) Fiduccia-Mattheyses partitioning 



(b) Multilevel partitioning 




(c) Refined Fiduccia-Mattheyses partitioning 



■■■■aoaaaiaitctafsiaaaailiHHiiaDiuBiaDaBaaciQaaaaaaiicc 



(d) Refined multilevel partitioning 



Figure 6 Example visualization of the alignment produced by the graph partitioning methods alone (ab) and graph partitioning followed 
by refinement (cd). Residue colors reflect how well the column is aligned based on residue match probabilities (darker is better). Partition cuts are 
colored to show the order of partitioning with darker cuts being performed earlier. 
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We decided to present results based on core-region 
scores only, since the corresponding sections of the ref- 
erence alignments are most reliable. Moreover, results for 
full sequence scores are very similar. 

Benchmarking MSARC variants 

Two steps of MSARC algorithm: stochastic alignment 
and iterative refinement follow the respective steps in 
Probalign [7]. Therefore we decided to set a bunch 
of related parameters to Probaligns defaults. Namely, 
MSARC was run with Gonnet 160 similarity matrix [20], 
gap penalties of —22, —1 and 0 for gap open, extension 
and terminal gaps respectively, P = 0.2, a cut-off value for 
posterior probabilities of 0.01 (values smaller than the cut- 
off are set to 0 and operations designed for sparse matrices 
are used in order to speed up computations), two itera- 
tions of the consistency transformation and 100 iterations 
of iterative refinement. 

On the other hand, we decided to evaluate three param- 
eters that seem to be crucial for steps specific for MSARC 
approach. First, residue clustering may be performed with 
basic or multilevel Fiduccia-Mattheyses algorithm. Sec- 
ond, weighted or unweighted consistency transformation 
may be applied. Third, stochastic pairwise alignment may 
be based on equations (5)-(8) (i.e. stochastic version of 
classical Gotoh algorithm) or equations (6) and (7) may 
be replaced with equations (9) and (10), respectively. The 
modified formula disallows consecutive insertions and 
deletions, as is done in Probalign and MSAProbs. 

Various combinations of the above options were tested 
on the BAliBASE sequences. Results are presented in 
Table 1. The variant with neighboring insertions and 



deletions allowed, weighted consistency transformation 
and residue clustering with basic Fiduccia-Mattheyses 
algorithm has the best overall results, so it was selected for 
comparison with other methods. However, the differences 
are rather marginal. 

Comparison to other aligners 

MSARC results were compared to CLUSTAL Q [1,21] 
ver. 1.1.0, DIALIGN-T [9] ver. 0.2.2, DIALIGN-TX [22] 
ver. 1.0.2, MAFFT [6] ver. 6.903, MUSCLE [5] ver. 3.8.31, 
MSAProbs [18] ver. 0.9.7, Probalign [8] ver. 1.4, ProbCons 
[7] ver. 1.12, T-Coffee [4] ver. 9.02, FSA [10] ver. 1.15.7 
and PicXAA [11] ver. 1.03. All the programs were executed 
with their default parameters. 

Table 2 shows the SP and TC scores obtained by the 
alignment algorithms on the BAliBASE 3.0 benchmark. 
The overall results show that MSARC and PicXAA sub- 
stantially outperform other non-progressive methods - 
DIALIGN-T and FSA have SP scores lower by ~ 10 and 
TC scores lower by ~ 15. Furthermore, MSARC and 
PicXAA achieve accuracy similar to the progressive meth- 
ods MSAProbs and Probalign - the ranges of SP and TC 
scores of all four programs are 0.2 and 3.6, respectively. 

The differences between best programs are not signif- 
icant in most benchmark series (see Table 3) and corre- 
spond to their structures - MSARC and PicXAA have 
the best results for test series RVll and RV40, and the 
worst performance on RV30. Distances in RV30 fami- 
lies are particularly well represented by guide trees (low 
similarity between highly conserved subgroups) and pro- 
gressive methods can benefit from it. On the other hand, 
series RVll contains highly divergent sequences for which 



Table 1 Evaluation of MSARC variants 



MSARC variant SP/TC scores 



Alt. indels 


Weighted 


Multilevel 


All 


RVll 


RV12 


RV20 


RV30 


RV40 


RV50 


yes 


yes 


no 


876 
57.1 


699 

46.3 


945 

85.7 


925 

39.2 


837 

47.2 


932 

62.3 


88.7 
51.6 


yes 


yes 


yes 


876 

57.0 


69.7 
465 


94.5 
858 


925 

39.0 


83.6 
46.9 


932 

61.8 


88.7 
519 


yes 


no 


no 


87.5 


69.3 


94.4 


925 


837 


93.0 


88.6 


56.6 


45.5 


85.6 


39.6 


476 


61.2 


49.6 


yes 


no 


yes 


87.5 

56.6 


69.6 
45.6 


94.5 
858 


925 

39.3 


83.4 
47.0 


93.1 
61.4 


88.4 
49.6 


no 


yes 


no 


87.5 


69.2 


94.4 


925 


83.5 


932 


890 


57.0 


45.6 


85.7 


39.5 


47.1 


62.2 


519 


no 


yes 


yes 


87.5 


69.2 


94.4 


925 


837 


932 


88.7 


571 


46.2 


85.6 


39.2 


47.7 


624 


51.6 


no 


no 


no 


87.5 


69.4 


945 


925 


83.5 


93.0 


88.5 


56.6 


45.6 


85.7 


397 


46.9 


61.3 


49.7 


no 


no 


yes 


87.5 


69.5 


94.4 


925 


83.5 


93.1 


88.6 


56.7 


45.7 


85.7 


39.1 


47.0 


61.7 


49.7 



All the combinations of the following options are evaluated: (dis-)allowing for neighboring insertions and deletions in pairwise alignments, (not) weighting sequence 
pairs in consistency transformation and (not) using multilevel scheme in residue clustering. Entries show the mean SP and TC scores for each alignment algorithm on 
the whole BAliBASE 3.0 dataset and each of its series. All scores are multiplied by 1 00. Best results in each column are shown in bold. 
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Table 2 Comparison of multiple sequence alignment methods 

SP/TC scores 

Aligner 



All 



RVll 



RV12 



RV20 



RV30 



RV40 



RV50 



BB40037 



Computation 
Time 



Non-progressive methods 

MSARC 



DIALIGN-T 

FSA 

PicXAA 



87.6 
57.1 



77.3 
42.8 



78.5 
42.1 



878 

59.4 



699 
463 



49.3 
25.3 



50.3 
26.9 



69.0 
463 



94.5 
85.7 



92.4 
81.8 



92.5 
39.2 



86.3 
29.2 



86.7 
18.7 



92.5 
41.6 



83.7 
47.2 



74.7 
34.9 



70.7 
27.6 



86.0 
59.8 



932 

62.3 



82.0 
45.2 



85.5 
46.2 



93.1 
624 



88.7 
51.6 



78.2 
39.8 



89.2 
53.0 



987 
700 



52.6 
0.0 



81.8 
30.0 



987 
700 



16:36:37 
1 :13:21 

35:15:34 
5:54:18 



Progressive methods 

CLUSTAL Q 

DIALIGN-TX 
MAFFT 
MSAProbs 
MUSCLE 
Probalign 
ProbCons 
T-Coffee 



84.0 


59.0 


55.4 


35.8 


78.8 


51.5 


44.3 


26.5 


86.7 


65.3 


58.4 


42.8 


878 


68.2 


607 


44.1 


81.9 


57.2 


47.5 


31.8 


87.6 


69.5 


58.9 


45.3 


86.4 


67.0 


55.8 


41.7 


85.7 


65.5 


55.1 


40.9 



90.6 
78.9 



89.2 
75.2 



93.6 
83.8 



946 
865 



91.5 
80.4 



946 

86.2 



94.1 
85.5 



90.2 
45.0 



87.9 
30.5 



92.5 
44.6 



928 
464 



92.6 
43.9 



91.7 
40.6 



91.4 
40.1 



86.2 
57.5 



76.2 
38.5 



85.9 
58.1 



865 
607 



81.4 
40.9 



85.3 
56.6 



84.5 
54.4 



83.7 
49.0 



90.2 
57.9 



83.6 
44.8 



91.5 
59.0 



92.5 
62.2 



86.5 
45.0 



92.2 
60.3 



90.3 
53.2 



89.2 
54.5 



86.2 
53.3 



82.3 
46.6 



908 
608 



83.5 
45.9 



88.7 
54.9 



89.4 
57.3 



89.4 
58.5 



61.2 
0.0 



52.8 
0.0 



56.4 
0.0 



59.5 
0.0 



48.4 
0.0 



54.2 
0.0 



59.3 
0.0 



12 : 15 

1 : 36 : 05 
54:04 

6:43:51 
23:32 

4:31 :41 

6 : 56 : 32 



50.9 
0.0 



Columns 2-9 show the mean SP and TC scores for each alignment algorithm on the whole BAIiBASE 3.0 dataset, each of its series and 
presents total CPU computation time (hh:mm:ss). All scores are multiplied by 1 00. Best results in each column are shown in bold. 



13:53:02 
case BB40037. The last column 



Table 3 Significance of differences in BAIiBASE 3.0 SP/TC scores 



Aligner 


RVll 


RV12 


RV20 


RV30 


RV40 


RV50 


Total 


Non-progressive methods 
















DIALIGN-T 


+8.6e-8 
+1.5e-6 


+7.7e-9 
+2.2e-8 


+1.3e-7 
+9.6e-5 


+2.7e-6 
+0.0024 


+2.1 e-9 
+4.9e-8 


+0.00098 
+0.027 


+5.3e-36 
+3.6e-26 


FSA 


+8.6e-8 
+1.2e-6 


+3.5e-6 
+0.00012 


+3.6e-8 
+1.2e-6 


+2.6e-6 
+8.5e-6 


+8.3e-9 
+ 1.2e-6 


+0.00081 
+0.021 


+3.6e-34 
+3.5e-27 


PicXAA 


+0.048 
+(0.53) 


-(0.82) 
-(0.98) 


-(0.055) 
-0.018 


-2.8e-5 
-7.2e-6 


+(0.11) 
-(0.052) 


-(0.063) 
-(0.37) 


-0.0079 
-1.3e-6 


Progressive methods 
















Clustal Q 


+2.6e-7 


+2.4e-5 


+0.0048 


-0.020 


+2.2e-6 


+0.017 


+1.1e-13 


+5.1e-5 


+0.00019 


-0.00054 


-0.00060 


+(0.16) 


-(0.77) 


+(0.30) 


DIALIGN-TX 


+1.0e-7 
+1.3e-6 


+6.2e-8 
+4.0e-7 


+2.3e-7 
+0.00040 


+8.7e-6 
+0.038 


+2.8e-9 
+ 1.3e-7 


+0.0017 
+(0.066) 


+3.1e-34 
+9.5e-23 


MAFFT 


+0.0031 
+(0.11) 


+0.00085 
+0.005 


-(0.64) 
-(0.052) 


-0.0009 
-0.0007 


+0.0005 
+(0.07) 


-(0.072) 
-(0.062) 


+0.028 
-(0.55) 


MSAProbs 


+0.028 


-(0.90) 


-0.01 1 


-0.00017 


+(0.61) 


-0.010 


-0.020 


+(0.23) 


-(0.67) 


-0.00032 


-1.4e-5 


+0.048 


-0.0086 


-5.9e-8 


MUSCLE 


+7.3e-6 
+0.00017 


+2.8e-6 
+0.00015 


+0.00015 
+(0.15) 


+(0.19) 
+(0.52) 


+7.6e-9 
+2.8e-6 


+0.010 
+(0.072) 


+2.9e-22 
+3.3e-12 


Probalign 


+(0.67) 
+(0.52) 


-(0.63) 
-(0.88) 


-0.032 
-6.8e-5 


-0.0099 
-0.00056 


+(0.62) 
+(0.060) 


-(0.18) 
-(0.32) 


-0.019 
-6.0e-6 


ProbCons 


+0.021 
+0.037 


+0.0042 
+(0.19) 


+0.028 
-(0.19) 


-(0.15) 
-0.010 


+0.00026 
+0.022 


-(0.12) 
-(0.17) 


+0.00087 
+(0.93) 


T-Coffee 


+0.0024 
+0.016 


+0.0017 
+0.013 


+0.0075 
-(0.51) 


-(0.29) 
-(0.099) 


+9.7e-5 
+(0.29) 


-0.048 
-0.026 


+ 1.3e-5 
+(0.70) 



Entries show p-values indicating the significance of the mean difference of SP/TC scores between MSARC and other aligners as measured using the Wilcoxon 
matched-pair signed-rank test. A + means that MSARC had a higher mean score while a — means MSARC had a lower mean score. Nonsignificant p-values (> 0.05) are 
shown in parentheses. 
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guide-tree is poorly informative, even if it represents the 
correct phylogeny, and RV40 contains sequences with 
N/C-terminal extensions which may affect the accuracy of 
the estimated phylogeny. These sequence families expose 
progressive methods to guide-tree bias. 

We illustrate this observation with an example of test 
case BB40037. As is shown in column 9 of Table 2, MS ARC 
outperforms progressive methods by a large margin. The 
TC scores of zero means that each alignment method 
has shifted at least one sequence from its correct posi- 
tion relative to the other sequences. Figure 7 presents the 
structure of the reference alignment, as well as alignments 
generated by MSARC, Probalign and MSAProbs. The 
large family of red, orange and yellow colored sequences 
near the bottom has been misaligned by the progressive 
methods. The reason for this is more visible in Figure 8, 
where sequences in alignments are reordered according to 
related guide-trees. 

Probalign aligns separately the first half of the sequences 
(blue and green) and the second half of the sequences 
(from yellow to red). Next, the prefixes of the second 
group are aligned with the suffixes of the first group, 
propagating an error within a yellow sub-alignment. 



MSAprobs aligns separately the dark blue, light blue 
and red sequences. Next the blue sub-alignments are 
aligned together. The resulting alignment has erroneously 
inserted gaps near the right ends of dark blue sequences. 
This error is propagated in the next step, where the suffix 
of the blue alignment is aligned with the prefix of the red 
alignment. Finally, the single violet sequence is added to 
the alignment, splitting it in two. 

For both programs, alignment errors introduced in the 
earlier steps are propagated to the final alignment. On the 
other hand, the non-progressive strategy used in MSARC 
yields a reasonable approximation of the reference align- 
ment (see Figure 7(ab)). 

Conclusions 

The progressive principle has dominated multiple align- 
ment algorithms for nearly 20 years. Throughout this 
time, many groups have dedicated their effort to refine 
its accuracy to the current state. Other approaches 
were omitted due to high computational complexity 
and/or unsatisfactory quality. However, recently several 
non-progressive methods were proposed. Two of them, 
PicXAA and MSARC proved to be competitive with the 



BAliBASE 

(a) 

MSARC 

(b) 



Probalig n 

(c) 



MSAProbs 

(d) 




]l II 



"^11 IIIIIIMU 

~flR[iii liiiiiiii l j 

3 iiijii 



Figure 7 Visualization of reference (a) and reconstructed (bed) alignments for test case BB40037. In all alignments sequences are ordered 
accordingly. Each sequence is colored based on the evolutionary distance to its neighbors in a phylogenetic tree, such that families of related 
sequences have similar colors. Trees for (a) and (b) are computed with the PhyML 3.0 program [23], using the maximum parsimony method. Trees 
for (c) and (d) are the guide-trees used by those aligners. 
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Figure 8 Guide trees (ac) and alignment visualizations (bd) for test case BB40037 and programs Probalign (ab) and MSAProbs (cd). Tree 
branches and aligned sequences are colored based on the evolutionary distances to their neighbors, as computed from the guide-trees used during 
alignment. Sequences in alignments are ordered following their order in trees, so related sequences have similar color and are positioned together. 



best progressive approaches. Moreover, both programs 
outperform progressive methods on sequence sets with 
evolutionary distances that are difficult to represent by a 
phylogenetic tree. 

Despite the algorithmic novelty, the non-progressive 
approaches to multiple alignment are interesting prepro- 
cessing tools for phylogeny reconstruction pipelines. The 
objective of these procedures is to infer the structure of 
a phylogenetic tree from a given sequence set. Multiple 
alignment is usually the first pipeline step. When align- 
ment is guided by a tree, the reconstructed phylogeny is 
biased towards this tree. In order to minimize this effect, 
some phylogenetic pipelines alternately optimize a tree 
and an alignment [24-26]. The unbiased alignment pro- 
cess of MSARC may simplify this procedure and improve 
the reconstruction accuracy, especially in the most prob- 
lematic cases. 

MSARC has also the potential for quality improve- 
ments. Alternative methods of computing residue align- 
ment affinities could be used to improve the accuracy 
of both MSARC and Probalign based methods. Other 
approaches to alignment graph partitioning may also lead 
to improvements in the accuracy of MSARC, for example 
a better method of pairing residues for multilevel coars- 
ening than the currently used naive consecutive neighbors 
merging. 

The main disadvantage of MSARC is its computational 
complexity, especially in the case of the multilevel scheme 



variant (MSARC is ~ 2.5 x slower than MSAProbs, the 
MSARC variant with multilevel scheme is even slower). 
This is the cost of avoiding the progressive approach. 



Endnote 

^Our notion of alignment graph slightly differs from 
the one of Kececioglu [27]: removing edges between 
clusters transforms the former into the latter. 
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